Selection-Based Language Model for Domain Adaptation using Topic Modeling

نویسندگان

  • Tsuyoshi Okita
  • Josef van Genabith
چکیده

This paper introduces a selection-based LM using topic modeling for the purpose of domain adaptation which is often required in Statistical Machine Translation. The performance of this selection-based LM slightly outperforms the state-of-theart Moore-Lewis LM by 1.0% for EN-ES and 0.7% for ES-EN in terms of BLEU. The performance gain in terms of perplexity was 8% over the Moore-Lewis LM and 17% over the plain LM. 1 Domain Adaptation in Statistical Machine Translation Domain adaptation is one important research area in Statistical Machine Translation (SMT) as well as other areas of NLP such as parsing. Domain adaptation tries to ensure that the performance is not radically decreased even if we translate a text in a test set whose genre is different from the parallel corpus which is used to build the system. Without loss of generality, the decoder of an SMT system can be written in the form of the noisy channel model argminE P (E|F )PLM (E) where the two components, the set of P (E|F ) and that of PLM (E), are the targets that we do domain adaptation on: the set of P (E|F ) is called a phrase table (or a rule table) and that of PLM (E) is called a language model (for simplicity, the model is written in the simplest form without indices). Hence, one approach to domain adaptation in SMT aims at obtaining a domain-adapted phrase table and language model [1]. In particular, it is observed in several papers that the domain adaptation of the language model is often the most effective route in domain adaptation. In this context, we explore domain-adapted language models using topic modeling [2] in this paper. Note that there is an alternative approach which applies transfer learning [3] for domain adaptation, which is not pursued in this paper. In the following, we focus on the domain adaptation of language models and we leave the topic of translation model domain adaptation as further work. The special setting for SMT would be the following: (1) there is a tendency that if a training corpus becomes big, e.g. more than a million sentences, we may need to think about the corpus as a combination of different genres, and (2) we may have some information about the genre of a test set as a whole or for each sentence (it is rare that we do not have any information about the genre of the test set). 2 Selection-Based Language Models Let us prepare n kinds of language models {PLM1 , . . . , PLMn} (we sometimes call this “a pool of language models” in the following) and a selection function f(s)where s denotes a test sentence and The topic modeling for translation model can be found in [4]. Main differences are the usage of crossentropy and interpolation. The topic modeling for system combination in SMT can be found in [5].

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An LDA-based Topic Selection Approach to Language Model Adaptation for Handwritten Text Recognition

Typically, only a very limited amount of in-domain data is available for training the language model component of an Handwritten Text Recognition (HTR) system for historical data. One has to rely on a combination of in-domain and out-ofdomain data to develop language models. Accordingly, domain adaptation is a central issue in language modeling for HTR. We pursue a topic modeling approach to ha...

متن کامل

Automatic transcription of lecture speech using topic-independent language modeling

We approach lecture speech recognition with a topicindependent language model and its adaptation. As lecture speech has its characteristic style that is different from newspapers and conversations, dedicated language modeling is needed. The problem is that, although lectures have many keywords specific to the topic and fields, available corpus of each domain is limited in size. Thus, we introdu...

متن کامل

Unsupervised topic adaptation for morph-based speech recognition

Topic adaptation in automatic speech recognition (ASR) refers to the adaptation of language model and vocabulary for improved recognition of in-domain speech data. In this work we implement unsupervised topic adaptation for morph-based ASR, to improve recognition of foreign entity names. Based on first-pass ASR hypothesis similar texts are selected from a collection of articles, which are used ...

متن کامل

Bridging the Language Gap: Topic Adaptation for Documents with Different Technicality

The language-gap, for example between lowliteracy laypersons and highly-technical expert documents, is a fundamental barrier for cross-domain knowledge transfer. This paper seeks to close the gap at the thematic level via topic adaptation, i.e., adjusting the topical structures for cross-domain documents according to a domain factor such as technicality. We present a probabilistic model for thi...

متن کامل

Topic adaptation for language modeling using unnormalized exponential models

In this paper, we present novel techniques for performing topic adaptation on an n-gram language model. Given training text labeled with topic information, we automatically identify the most relevant topics for new text. We adapt our language model toward these topics using an exponential model, by adjusting probabilities in our model to agree with those found in the topical subset of the train...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013